When and How Can Data be Efficiently Released with Privacy?
نویسندگان
چکیده
We consider private data analysis in the setting in which a trusted and trustworthy curator, having obtained a large data set containing private information, releases to the public a “sanitization” of the data set that simultaneously protects the privacy of the individual contributors of data and offers utility to the data analyst. The sanitization may be in the form of an arbitrary data structure, accompanied by a computational procedure for determining approximate answers to queries on the original data set, or it may be a “synthetic data set” consisting of data items drawn from the same universe as items in the original data set; queries are carried out as if the synthetic data set were the actual input. In either case the process is non-interactive; once the sanitization has been released the original data and the curator play no further role. Blum et al. (STOC ‘08) showed the remarkable result that, for any any set X of potential data items and any “concept” class C of functions f : X → {0, 1}, the exponential mechanism of McSherry and Talwar (FOCS ‘07) can be used to (inefficiently) generate a synthetic data set that maintains approximately correct fractional counts for all concepts in C, while ensuring a strong privacy guarantee. In this work we investigate the computational complexity of non-interactive privacy mechanisms, mapping the boundary between feasibility and infeasibility. Let κ be a computation parameter. We show 1. When |C| and |X| are both polynomial in κ, it is possible to efficiently (in κ) and privately construct synthetic data sets maintaining approximately correct counts, even when the original data set is very small (roughly, O(2 √ log |C| log |X|)). 2. When either |C| or |X| is superpolynomial in κ there exist distributions on data sets and a choice of C for which, assuming the existence of one-way functions, there is no efficient private construction of a synthetic data set maintaining approximately correct counts. 3. Turning to the potentially easier problem of privately generating a data structure from which it is possible to approximate counts, there is a tight connection between hardness of sanitization and the existence of traitor tracing schemes, a method of content distribution in which (short) key strings are assigned to subscribers in a way that, given any useful “pirate” key string constructed by a coalition of malicious subscribers, it is possible to identify at least one colluder. Using known schemes, we obtain a distribution on databases and a concept class permitting inefficient sanitization (via Blum et al.), but for which no efficient sanitization is possible (under certain complexity assumptions). Our algorithmic results give for the first time insight into the value of synthetic data sets beyond that of general data structures.
منابع مشابه
An Architecture for Security and Protection of Big Data
The issue of online privacy and security is a challenging subject, as it concerns the privacy of data that are increasingly more accessible via the internet. In other words, people who intend to access the private information of other users can do so more efficiently over the internet. This study is an attempt to address the privacy issue of distributed big data in the context of cloud computin...
متن کاملEffects of Architectural Components on the Satisfaction Rate of Residents with Different Ages and Genders in Relation to Privacy (Case Study: a Residential Complex in Tabriz)
The main objective of this paper is to assess the effects of architectural components on residential complex to achieve a desirable rate of privacy for the residents. Privacy is a process in which the transactions among individuals can be adjusted by means of providing a suitable relationship between people and their built environment. In addition, since the transaction between the behavior o...
متن کاملGranulation as a Privacy Protection Mechanism
How to achieve a balance between data publication and privacy protection has been an important issue in information security for several years. When microdata is released to users, attributes that clearly identify individuals are usually removed. Nevertheless, it is still possible to link released data with some public or easy-to-access databases to obtain confidential information. To safeguard...
متن کاملPrivacy Preserving Categorical Data Analysis with Unknown Distortion Parameters
Randomized Response techniques have been investigated in privacy preserving categorical data analysis. However, the released distortion parameters can be exploited by attackers to breach privacy. In this paper, we investigate whether data mining or statistical analysis tasks can still be conducted on randomized data when distortion parameters are not disclosed to data miners. We first examine h...
متن کاملفرایند حفظ حریم بیماران: یک نظریه داده بنیان
Introduction: Ethics, customs, and divine and human values in all scientific and non-academic issues are accepted among all human societies in different eras. The purpose of this study was to understand the experiences of nursing professionals about the patient's privacy. Methods: 21 participants were selected by theoretical sampling which was guided by emerging categories. All participants we...
متن کاملمصاحبه الکترونیک در پژوهشهای کیفی
Interview is one of the most common tools among many social and human researches, which has been increasingly taken into account in recent years by expanding the quality and integrative approaches. An interview is usually a deliberate pre-designed conversation about a specific topic in which at least two person are involved. The person who devise the interview questions is the interviewer and t...
متن کامل